Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix more Unicode bugs #24182

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented Nov 4, 2024

  • Use Latin-1 in many native file write rules for consistency with the internal encoding.
  • Use Latin-1 for the resolved repository file and the JSON profile.
  • Fix unused_input_list handling of non-ASCII characters in file names.
  • Flip the legacy_utf8 parameter of repository_ctx.file to False and make it a no-op. With the previous default, any non-ASCII characters would be written out as double encoded UTF-8, which is not a useful choice.
  • Change repository_ctx.template to operate on raw bytes for consistency with repository_ctx.read and to fix substitution with non-ASCII keys/values.
  • Move some usages of UTF_8 closer to their usage site to clarify why they are correct.

@fmeum fmeum force-pushed the 23859-unicode-starlark branch 5 times, most recently from d539093 to b378d1f Compare November 5, 2024 14:29
@fmeum fmeum requested a review from tjgq November 5, 2024 15:55
@fmeum fmeum marked this pull request as ready for review November 5, 2024 15:55
@fmeum fmeum requested review from aranguyen and removed request for a team November 5, 2024 15:55
@github-actions github-actions bot added team-Performance Issues for Performance teams team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Rules-Java Issues for Java rules team-Rules-CPP Issues for C++ rules team-Local-Exec Issues and PRs for the Execution (Local) team team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels Nov 5, 2024
@fmeum
Copy link
Collaborator Author

fmeum commented Nov 5, 2024

@tjgq This is stacked on #24010, but the last commit is new and ready for review.

@tjgq
Copy link
Contributor

tjgq commented Nov 7, 2024

@fmeum Mind rebasing before I review?

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

@tjgq Rebased! Let me know if you would prefer me to split this up further to simplify the review.

@fmeum
Copy link
Collaborator Author

fmeum commented Nov 7, 2024

@bazel-io fork 8.0.0

Comment on lines +1331 to +1332
No-op. This parameter is deprecated and will be removed in a future version of \
Bazel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure we add this to the release notes when we cherry-pick it.

(Ordinarily, I would ask you to add a RELNOTES[INC] to the PR description, but I suspect that's counter-productive: we don't have a process to pick up release notes from CLs submitted after the branch is cut, so it would probably end up in the release notes for 9.0.0 instead.)

@@ -182,7 +182,7 @@ Builder setQuiet(boolean quiet) {

private static String toString(ByteArrayOutputStream stream) {
try {
return new String(stream.toByteArray(), UTF_8);
return stream.toString(UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this one not ISO_8859_1, matching the other branches in execute?

String tpl = FileSystemUtils.readContent(t.getPath(), StandardCharsets.UTF_8);
// Read and write files as raw bytes by using the Latin-1 encoding, which matches the encoding
// used by Bazel for strings.
String tpl = FileSystemUtils.readContent(t.getPath(), StandardCharsets.ISO_8859_1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Static import (also below)

Comment on lines +30 to +32
public static String encode(String username, String password) {
return "Basic "
+ Base64.getEncoder().encodeToString((username + ":" + password).getBytes(UTF_8));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better for NetrcParser to decode the username/password into Latin1 and use Latin1 here? This would avoid creating an exception to the rule "strings in the Bazel heap are raw bytes".

(Alternatively, add a comment noting the exception.)

@@ -138,8 +138,7 @@ private void createVendorFile(Path vendorPath, Path vendorFilePath)
throws VendorFileFunctionException {
try {
vendorPath.createDirectoryAndParents();
byte[] vendorFileContents = VENDOR_FILE_HEADER.getBytes(UTF_8);
FileSystemUtils.writeContent(vendorFilePath, vendorFileContents);
FileSystemUtils.writeContent(vendorFilePath, StandardCharsets.UTF_8, VENDOR_FILE_HEADER);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is absolutely correct, but is an interesting exception to the "strings are raw bytes" rule (literals in source code). Let's also add a comment.

Also, static import.

@@ -171,7 +171,7 @@ public FilterShowIncludesOutputStream(OutputStream out, String sourceFileName) {
public void write(int b) throws IOException {
buffer.write(b);
if (b == NEWLINE) {
String line = buffer.toString(StandardCharsets.UTF_8.name());
String line = buffer.toString(StandardCharsets.UTF_8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this an exception to the "strings are raw bytes" rule?

Specifically, won't these strings later be compared against filenames coming from BUILD/bzl files, which are raw bytes?

setup_starlark_repository

if "$is_windows"; then
# äöüÄÖÜß in UTF8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "UTF-8" (also below)

@@ -242,6 +250,43 @@ function test_dependency_pruning_scenario() {
check_unused_content "pkg/c.input"
}

function test_dependency_pruning_scenario_unicode() {
if "$is_windows"; then
# äöüÄÖÜß in UTF8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "UTF-8" (also below)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-review PR is awaiting review from an assigned reviewer team-Configurability platforms, toolchains, cquery, select(), config transitions team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. team-Local-Exec Issues and PRs for the Execution (Local) team team-Performance Issues for Performance teams team-Remote-Exec Issues and PRs for the Execution (Remote) team team-Rules-CPP Issues for C++ rules team-Rules-Java Issues for Java rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants